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DISCRETE  DYNAMIC  PROGRAMMING1 

By  David  Blackwell 
University  of  California ,  Berkeley 

1.  Introduction  and  summary.  We  consider  a  system  with  a  finite  number  S  of 
states  s9  labeled  by  the  integers  1,  2,  •  *  •  ,  S.  Periodically,  say  once  a  day,  we 
observe  the  current  state  of  the  system,  and  then  choose  an  action  a  from  a 
finite  set  A  of  possible  actions.  As  a  joint  result  of  the  current  state  sand  the 
chosen  action  a,  two  things  happen:  (1)  we  receive  an  immediate  income  i(s,  a) 
and  (2)  the  system  moves  to  a  new  state  s'  with  the  probability  of  a  particular 
new  state  s'  given  by  a  function  q  =  q($'  |  s,  a) .  Finally  there  is  specified  a  dis¬ 
count  factor  0,  0  ^  0  <  1,  so  that  the  value  of  unit  income  n  days  in  the  future 
is  071.  Our  problem  is  to  choose  a  policy  which  maximizes  our  total  expected  in¬ 
come.  This  problem,  which  is  an  interesting  special  case  of  the  general  dynamic- 
programming  problem,  has  been  solved  by  Howard  in  his  excellent  book  [3]. 
The  case  0=1,  also  studied  by  Howard,  is  substantially  more  difficult.  We  shall 
obtain  in  this  case  results  slightly  beyond  those  of  Howard,  though  still  not 
complete.  Our  method,  which  treats  0  =  1  as  a  limiting  case  of  0  <  1,  seems  rather 
simpler  than  Howard's. 

2.  Definitions  and  notation.  Denote  by  F  the  (finite)  set  of  functions  /  from 
S  to  A.  By  a  policy  7r,  we  mean  a  sequence  {fn  ,  n  —  1,  2,  •  •  •}  of  functions /»  e  F. 
Using  policy  t  means  that,  if  we  find  the  system  in  state  5  on  the  nth  day.  the 
action  chosen  that  day  is  fa(s) .  For  any  sequence  pi ,  •  •  *  ,  Qn  ,  gn  £  F,  and  any 
policy  t r  =  {/„},  we  denote  by  gt ,  •  •  •  ,  gN ,  tt  the  policy  [K]  with  hn  =  gn , 
1  ^  n  ^  N}  hn  =  fn-N  ,  n  >  N.  For  any  g  £  F,  we  denote  by  giN\  ir  the  policy 
{hn}  with  hn  =  g,  1  ^  n  g  N,  hn  =  fn-N ,  n  >  N,  and  by  g{°°}  the  policy  {/i„} 
with  hn  =  g  for  all  n.  Finally,  we  denote  by  Tir  the  policy  {h1x}  with  hn  =  /n+i 
for  all  n. 

We  associate  with  each  /  £  F  (l)  the  iS  X  1  column  vector  r(f)  whose  sth 
element  is  i(s}  f(s))}  and  (2)  the  S  X  S  Markov  matrix  Q(f )  whose  (s,  s')  ele¬ 
ment  is  q(s'  |  s,  f(s)).  Thus  r(f)  and  Q(f )  specify  the  income  and  the  law  of 
motion,  as  a  function  of  the  current  state,  on  a  day  when  our  rule  of  action  is  /. 
If  we  use  policy  7r  =  {/„}  and  the  system  is  initially  in  state  s,  the  probability 
that  the  system  will  be  in  state  s'  at  the  end  of  the  nth  day  is  the  ($,  s')  element 
of  the  matrix  Qn( x)  =  Q(fi)Q(f2)  •  •  -Q(fn)  •  Thus  the  total  expected  return  from 
7 r  is  the  column  vector 

I'M  =  Z/m-MK/n+i), 

71  =  0 
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where  QoW  =  /,  the  S  X  S  identity  matrix.  We  have 

F(t)  =  r(/i)  +W,)Ee,-i(^)f(/n+1) 

=  K/i)  +W.)F(?»). 

*  We  associate  with  each  /  £  F  the  transformation  !(/)  which  maps  the  S  X  1 
column  vector  w  into  L(f)w  =  r(f)  +  PQ(f)w.  Thus  V(f,  ir)  =  L(f)V( tt),  and 
V(fi,  J*  ,  7r)  =  L(fi)  •  *  •  L(/isr)  F (7r) .  For  any  two  column  vectors  Wi ,  , 

we  write  Wi  W2  if  every  coordinate  of  wx  is  at  least  as  large  as  the  corresponding 
coordinate  of  W2 ,  and  wx  >  W2  if  wx  ^  W2  and  wx  7^  W2 .  Note  that  L(f)  is  mono¬ 
tone ,  i.e.,  w>i  ^  102  implies  L(f)w  x  ^  L(f)u)2 . 

For  any  two  policies  7ri ,  7r2 ,  we  write  tx  ^  7r2  if  F(tti)  ^  F(7r2),  and  xi  >  7t2 
if  F(7Ti)  >  7(^2).  A  policy  71-*  is  called  optimal  if  71-*  ^  7r  for  all  tt. 

3.  Optimal  policies  for  (3  <  1.  The  methods  of  this  section  are  familiar  to  work¬ 
ers  in  dynamic  programming,  from  the  work  of  Dvoretzky,  Kiefer,  and  Wolfowitz 
[2],  Karlin  [4],  and  Bellman  [1]. 

Theorem  1.  If  7?-*  (/,  7r*)  for  allfe  F,  then  71-*  is  optimal. 

Proof.  Our  hypothesis  is  that 

L(f)  F(tt*)  ^  F(V)  for  all  /  e  F. 

Then  for  any  policy  t  =  {/„},  we  have  L(jN)V(ir*)  ^  V (tt*)  ,  so  that,  using 
the  monotoneity  of  L(fi)  ■  ■  ■  L{fk r_i) ,  L(ji)  •  •  •  L(fN)  V (t*)  ^  L(fi)  •  •  • 

L(U-i)  7(»*),  i-e-,  (/1 ,  •  •  •  ,  A  ,  O  SS  Cfx »  *  *  *  » A-i ,  O  •  Thus 

tt*  £  (/,,  -  ,/„/) 

for  all  AT,  i.e.,  F(7r*)  =i  F(/i ,  •  •  •  ,  jW  ,  «•*)  for  all  AT.  Letting  IF  — ►  «  we  obtain 
(|8  <  1), 

F(tt*)  S;  F(t), 

and  the  proof  is  complete. 

Theorem  2.  If  (/,  7r)  >  tt,  thenf 90>  >  ?r. 

Proof.  Our  hypothesis  is  L(/)F(7t)  >  F(7t).  Applying  the  monotone  oper¬ 
ator  LN~l(f)  yields 

L"(f)V( tt)  ^  Iw(/)7(f), 

so  that  (/w,  tt)  ^  (/,  tt)  for  all  A"  ^  1.  Letting  N  — >  °o  yields /(00)  ^  (/,  ?r) ,  so 
that/(00)  >  7r . 

Our  principal  result,  describing  the  Howard  policy  improvement  routine  for 
<  1,  is 

Theorem  3.  Ta/ce  an?/  /  £  F.  For  oac/i  s  £  $  denote  6?/  G(s ,  /)  the  set  of  all  a  for 
which 


i(s,  a)  +  /3p(a>a)F(/w)  >  F,(/<00>), 
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where  p(s ,  a)  is  the  1  X  S  row  vector  whose  s'th  coordinate  is  q($'  |  sy  a)  and 
Fs(/(00))  denotes  the  sth  coordinate  of  F(/<00)).  If  G(sy  f)  is  empty  for  all  s,  then 
/(0O)  is  optimal.  For  any  g  such  that 

(a)  g{s)  €  G(s,f)  for  some  s  and 

(b)  #(s)  =  f(s)  whenever  g(s )  e  G(syf)y  we  have  g{<x>)  >  f{V>). 

Proof.  The  sth  coordinate  of  V(g,  /(<0))  is  i(s,  g(s))  +  fip{s,  g(s))  F(/(00)) . 
This  will  exceed  Vs(f{a0))  if  and  only  if  g(s)  £  G($ ,  /),  and  will  equal  V8(f{<x>))  if 
g(s)  =  /(«)•  Thus  if  G(sy  f)  is  empty  for  all  s,  /(co)  ^  (g,  /(00)) ,  for  all  g  so  that, 
from  Theorem  l,/coo)  is  optimal.  On  the  other  hand,  for  any  g  satisfying  (a)  and 
(b),  we  have  >  /(oo)  so  that,  from  Theorem  2,  g{cc)  >  fw\ 

Call  a  policy  7 r  =  {/n}  stationary  if  fn  is  independent  of  ny  i.e.,  if  it  =  /(co)  for 
some  /  e  F.  As  a  consequence  of  Theorem  3,  we  have  the 

Corollary.  There  is  an  optimal  policy  which  is  stationary. 

Proof.  According  to  Theorem  3,  if  we  take  any  stationary  policy  /(00>,  either 
it  is  optimal  (case  G(sy  f)  empty  for  all  s)  or  it  has  a  stationary  improvement 
g(QO)  (case  G(s,  f)  nonempty  for  some  s).  Since  there  are  only  finitely  many  sta¬ 
tionary  policies,  there  is  one  which  has  no  stationary  improvement,  so  that  it 
must  be  optimal. 

4.  Optimal  policies  for  0  =  1.  For  the  case  0  =  1,  the  total  income  from  a  given 
policy  is  typically  infinite.  We  may  attempt  instead  to  maximize  the  average 
rate  of  income  or  to  find  policies  which  are  optimal  for  all  0  sufficiently  near  1. 
We  shall  adopt  the  second  approach.  Since  0  is  now  variable,  it  will  sometimes 
be  desirable  to  exhibit  the  dependence  of  V(r)  and  other  quantities  on  0;  thus 
we  shall  write  V r)  and  speak  of  0-optimal  policies.  Denote  by  £7(0)  the  ex¬ 
pected  total  return  from  a  0-optimal  policy.  We  shall  say  that  a  policy  ir  is 
optimal  if  it  is  0- optimal  for  all  0  sufficiently  near  1,  i.e.,  if  V^t)  —  £7(0)  for 
all  0  sufficiently  near  1,  and  shall  say  that  t  is  nearly  optimal  if 

*7(0)  -  VM  as  0  — »  1. 

Our  problem  is  then  to  find  optimal  and  nearly  optimal  policies. 

We  shall  need  certain  known  facts  about  Markov  matrices,  summarized  as 

Lemma  1.  Let  Q  be  any  S  X  S  Markov  matrix . 

(a)  The  sequence  I  +  Q  +  *  *  *  +  QN/N  +  1  converges  as  N  to  a  Markov 
matrix  Q*  such  that 

QQ*  =  Q*Q  =  Q*Q*  =  Q*, 

(b)  rank  (I  —  Q)  +  rank  Q*  =  S. 

(c)  For  every  S  X  1  column  vector  c,  the  system 

Qx  =  xy  Q*x  =  Q*c 

has  a  unique  solution . 

(d)  1  —  (Q  —  Q *)  is  nonsingular ,  and 

H(p)  =  E  pn(Qn  -  Q*)  ->  H  =  (I  -  Q  +  Q*r*  -  Q* 

0 
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as  /3  — >  1. 

H(0)Q*  =  Q*H(t 3)  =  HQ*  =  Q*H  =  0 

and 

(I  -  Q)H  =  H(I  -  Q)  =  I  -  Q*. 

These  facts  may  all  be  found  in  Kemeny  and  Snell  [5] ;  we  indicate  the  proof  of 
(d)  only. 

Proof  of  (d).  From  (a)  we  have,  for  n  >  0,  Qn  —  Q*  =  (Q  —  Q*)n,  so  that 

H(n)  =  £o  r(Q  -  Q*y  -  q*  =  [/  -  m  -  q*)]-1  -  q*,  u., 

(H(0)  +  Q*)(I  -  0(Q  -  Q*))  =/, 

i.e., 

(1)  (H(0)  +  Q*)(I  -  Q  +  Q*)  =  I  -  (1  -  p)H(P)(Q  ~  Q*)- 

Now  (7  —  1  summability  of  {Q?1}  to  Q*  implies  Abel  summability  of  {Qn  —  Q*} 
to  Q: 

(1-P)  it  0\Qn  -  Q*)  =  (1  -  0)H(0)  ^  0  as  0  ->  1. 

0 

Thus  the  matrix  on  the  right  of  ( 1)  goes  to  I  as  /?  —>  1,  and  I  —  Q  +  Q*  is  non¬ 
singular.  Multiplying  (1)  by  (I  —  Q  +  Q*)_1  and  letting  /3  — >  1  yields  #(/3)  + 
Q  +  Q*)'1  as  /3  — »  1.  Verification  of  the  equalities  asserted  in  (d)  is 
straightforward. 

Our  results  for  /3  =  1  are  summarized  as  Theorem  4  below.  We  shall  some¬ 
times,  to  simplify  statements,  speak  of  “the  policy/”  when  we  mean  the  policy 
/(co).  For  example,  we  write  Vp (/)  instead  of  T'^(/<00))  - 

Theorem  4.  Take  any  f  s  F  and  denote  by  Q*(f)  the  matrix  Q*  associated  with 
Q(/).  Then 

(a)  Vp(f)  =  [*(/)/(  1  -  /?)]  +  y(/)  +  €(j8,/), 

w/iore  $(/)  7s  the  unique  solution  of 

(I  -  <?(/))*  =  0,  Q*(/)s  = 

?/(/)  7s  the  unique  solution  of 

(I  -  Q(/))2/  =  r(/)  -  *(/),  Q*(/)v  =  0, 

and  e(/5, /)  — >  0  as  /3  — >  1. 

(b)  For  each  s,  denote  by  G($,  f)  the  set  of  a  for  which  either 

p(s,  a)x(f)  >  xs(f) 


or 


© 


fc* 


vis,  a)z(f)  =  Xsif) 
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and 


i(s,  a)  +  p(s,  a)y(f)  >  xs(f)  +  ys(f), 

where  xffif),  y8(f)  denote  the  sth  coordinates  of  x(f),  ?/(/).  For  any  g  such  that 
g(s)  e  G(s,  f)  for  some  s  and  g(s)  »  /($)  whenever  g(s)  e  G(s,  f),  g  >  f  for  all  (3 
sufficiently  near  1. 

(c)  For  each  s,  denote  by  E(s,  /)  the  set  of  a  for  vMch 

p(s,  a)x(f)  =  x,(f) 

and 

i(s,  a)  +  p(s,  a)y(f )  =  xs(f)  +  ys(f) 

(always  f(s )  £  E(s,  f ) ) .  If,  for  each  s,  G(s,  f)  is  empty  and  E(s,  f)  contains  only 
the  point  f(s) ,  then  f  is  optimal. 

(d)  If  for  each  s ,  G(s,  f)  is  empty  and  g{s)  e  E(s,  f)  for  all  s  implies 

Q*(g)Q*(f)  =  Q*(g ), 

^henf  is  nearly  optimal . 

(e)  For  any  /0  for  which  G(s,  /0)  is  empty  for  all  s,  x(f0)  ^  x(g)  for  all  g.  De¬ 

note  by  F*  the  set  of  all  g  such  that  x(g)  =  x(ff) .  There  is  an  f*  e  F*  with  y(f*)  ^ 
y(g)  for  all  g  £  F *.  The  nearly  optimal  g’s  are  exactly  those  for  which  x(g)  =  #(/*) 
and  y(g )  =  y(f).  .  »> 

Proof.  For  (a),  we  have 

VMim>)  =  [i  -  PMYMf)  =  E/W/M/) 

0 

=  (E  fi nQ*(f)  +  E  r (an(/)  -  Q*(/)))  r(f) 

=  +  H(f)r(f)  +  (H(p,f)  -  H(f))r(f). 

Thus  (a)  is  established,  with  x(f)  ~  Q*(f)r(f),  y(f)  =  H(f)r(f),  and  e(/3, /)  = 
f)  —  H(f)  )r(f).  For  the  rest  of  the  theoreih,  we  simply  calculate 
Vp(g ,  /(00)),  using  the  representation  (a),  and  ask  when,  for  f3  near  1,  does  this 
exceed  7^(/<fl0)).  We  have 

V,(gJ^)  -  r(g)  +mgWf™) 

=  +  r(g)  -  Q(g)x(f)  +  Q(g)y(f)  +  ei(P,f,g), 

where  €i(&  /,  g)  -  —(1  —  f3)Q(g)y(f)  +  f3Q(g)e(P,f)  0  as  0  — >  1. 

We  see  that  g(s )  £  G(s,  f)  implies  that,  for  /3  near  1,  the  sth  coordinate  of 
V(j(g ,  /(00))  exceeds  that  of  Vp(f{ao)).  Since  g(s )  =  /(s)  implies  equality  of  the 
sth  coordinates, of  Vp(g,  /(00))  and  7^(/(00))  for  all  13,  we  obtain  (b)  at  once  from 
Theorem  3.  Similarly,  the  hypotheses  of  (c)  imply  that,  for  all  (3  near  1, 

V,{g,D  S  Ve(D 
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(with  strict  inequality  unless  g  =  /),  so  that  from  Theorem  3  /  is  optimal. 

For  (d)  we  shall  need 

Lemma  2.  For  any  f,  g  e  F  for  which  g(s )  e  E(s,f)  for  all  s,  we  have  x(g)  =  x(f) . 
If  in  addition  Q*{g)Q*{f)  =  Q*(g),  theny(g)  =  y{f). 

Proof  of  Lemma  2.  That  g(s )  e  E(s,  /)  for  all  s  is  equivalent  to,  writing  x, 
y  for  x(f),  y(f), 

(3)  Q(g)x  =  x 
and 

(4)  r(g)  +  Q{g)y  =  x  +  y. 

Multiplying  (4)  by  Q*(<?)  yields 

(5)  Q*(g)r(g)  =  Q*(g)x. 

But  (3)  and  (5)  have  the  unique  solution  x  =  x(g ),  so  that  x{g)  =  x(f).  Also 
from  Q*(f)y  =  0  we  obtain  Q*(g)Q*(f)y  =  0,  so  that,  if  Q*(g)Q*(f)  =  G*(0), 
we  obtain 

(6)  e*fa)y  =  o. 

But,  since  x  =  the  unique  solution  of  (4)  and  (6)  is  y  =  y(g ),  so  that 

2/(0)  =  y(f)> 

We  return  to  (d)fTet  /  satisfy  the  hypotheses  of  (d),  and  choose  /3  so  near  1 
that,  for  any  pair /x  ,/2 ,  we  have  Vpifiji^)  ^  7^(/»80))  implies /i(s)  eG(sJi)  U 
I?(s,  /i)  for  all  s.  If  our  /  is  not  /3-optimal,  let  /o  =  /i ,  /j  ,•••,/*  be  a  sequence 
of  /3-improvements,  obtained  as  in  Theorem  3,  terminating  in  a  /3-optimal  /*  . 
Then 

/i+i(«)  £  G(s,  /<)  U  $($,/,•) 

for  all  i.  We  show  by  induction  on  i  that  x{ff)  =  x(f0)  and  y(ff)  =  t/(/0).  This 
is  true  for  i  =  0.  If  true  for  a  given  i,  then,  since  G(s,  /),  E(s ,  /)  depend  only 
on  a ;(/),  $/(/),  we  have  G(s,  /*)  is  empty  and  E(s,  /<)  =  2?(«,  /).  Then  /,  /<+1 
satisfy  the  hypotheses  of  /,  g  in  Lemma  2,  so  that  x (fi+i)  =  x(f),  y(fi+i)  =  ?/(/) . 
Thus,  writing /(/3)  for  the  /3-optimal/* ,  we  have 

t/(/3)  =  [*(/)/(  1  -  «]  +  </(/)  + 

Since 

W(o0))  -  [*(/)/(l  -  «]  +  y(f)  +  e(/3 ,/), 

we  have  C7(/3)  —  7^(/(00))  — >  0  as  13  — >  1,  and  /(0O)  is  nearly  optimal. 

To  establish  (e),  we  obtain  from  (2),  if  G(s ,  /o)  is  empty  for  all  s,  the  in¬ 
equality 

(7)  VsigJ^)  ^  V^(fox))  +  t(/3)S  for  fi  near  1, 

where  r(/3)  is  a  scalar  function  of  /3,  the  maximum  coordinate  of  €i(/3,  /0 ,  g)  — 
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fh),  and  5  is  the  S  X  1  column  vector  with  all  coordinates  unity.  We  have 
t(/3)  — >  0  as  /3  — » 1.  Denoting  Lp(g)  by  L ,  we  rewrite  (7)  as  LVp(fo)  ^  Vp(fo)  + 
r(/3)<5  for  /3  near  L  We  show  by  induction  on  n  that,  for  all  n 

(8)  VVAfo  ^  Vfi(fo)  +  (1  +  0  +  •  •  •  +  0”-1)  r(0)5  for  0  near  1. 
If  (8)  holds  for  a  given  n,  we  obtain,  applying  L, 

Ln+1V„(f{ o)  gS  L[r.h.s.  of  (8)] 

=  J'(fi 0  +  0Q(?)  Us(/o)  +  0(1  +  0  +  •  •  •  +0"  1)t(0)5, 

=  +  0(1  +  0  +  •  •  •  +  0”~1)t(0)5 

^  ^/j(/o)  +  [1  +  0  +  •  •  •  +  0”]r(0)S, 

where  the  last  inequality  is  obtained  by  using  (7). 

Thus,  LnVp(fo)  ^  Vfi(fo)  +  [r(/3)/(  1  —  /3)]5  for  all  n ,  so  that,  for  all  £  £  F 

(9)  Vfi(g)  =  lim„^co  Z/%(/o)  ^  F*(/o)  +  [r(/3)/(l  -  fi)]6  for  (3  near  1. 
But 

(10)  V,(g)  -  Vp(fo)  =  +  tf(ff)  -  v(/»)  +  «(0,ff)  -  e(0,/o). 

(9)  and  (10)  imply  a?(gf)  ^  a(/o). 

Take  any  /*  which  is  /3-optimal  for  a  set  of  /3’s  having  1  as  a  limit  point.  From 

(10) ,  with  g  =  /*  we  obtain  x(f*)  ^  $(/o),  so  that  &(/*)  =  #(/o).  For  any 
0  we  have  F^/*)  -  Vp(g)  =  ?/(/*)  —  2/(0)  +  €(&/*^  -  e(/3,  g),  so  that, 
letting  /5  — >  1  through  a  sequence  for  which  /*  is  /3-optimal,  we  obtain  y(f  )  ^ 
y(g)  for  all  g  e  F*,  The  last  assertion  of  (e)  is  now  immediate. 

Theorem  4  does  not  describe  an  algorithm  which  is  guaranteed  to  lead  to  op¬ 
timal  or  even  near  optimal  policies,  and  which  is  comparable  in  simplicity  to  the 
algorithm  described  by  Theorem  3  for  /3  <  1.  The  algorithm  is  simple  until  we 
reach  an  /  for  which  6(s,  f )  is  empty.  At  this  point,  if  E(s}  f )  contains  for  each  $ 
only  the  single  element /($),/  is  optimal.  If  not,  we  know  only  that  x(g)  ^  x(f) 
for  all  g,  so  that  we  have  a  policy  which  maximizes  our  average  return.  In  one 
case  the  verification  of  (d)  is  immediate.  This  is  the  case  in  which  there  is  a 
single  terminal  state  s*  which  is  certain  to  be  reached  eventually,  no  matter  where 
we  start  or  which  policy  we  use,  and  which  can  never  be  left  once  reached.  In 
this  case  for  every  g}  Q*(g)  is  the  matrix  with  every  row  the  s*  unit  vector,  so 
that  /  will  satisfy  the  hypothesis  of  (d)  and  be  nearly  optimal.  In  general,  the 
checking  of  (d)  is  tedious  and,  if  it  fails,  we  are  reduced  to  determining  the  set 
F*}  calculating  y(g)  for  each  g  £  F *,  and  selecting  a  g  for  which  y{g)  is  maximal. 
Theorem  5.  There  is  an  optimal  policy  which  is  stationary. 

Proof.  For  each  s  and/,  the  sth  coordinate  of  Vp(f)  is  a  rational  function  of  /3, 
as  the  representation  V  =  (I  —  (3 Q)“V  shows.  Let/*  be  /3-optimal  for  a  set  of 
f3’&  having  1  as  a  limit  point.  Then,  for  every  g ,  Vp(f*)  ^  V$(g)  for  a  set  of  /3’s 
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having  1  as  a  limit  point.  Since  all  coordinates  of  F/s (/*)  and  V$(g)  are  rational 
functions  of  /3, 

F/s (/*)  ^  F^)  for  all  /3  near  1. 

Since  this  holds  for  every  g  e  F,  f*  is  optimal. 

We  close  with  two  examples. 

Example  1.  An  f  which  satisfies  the  hypotheses  of  (d)  of  Theorem  4,  hut  is  not 
optimal.  There  are  two  states,  1  and  2,  and  two  actions,  1  and  2.  In  state  1  action 

1  yields  $1,  and  the  system  remains  in  state  1  with  probability  .5  and  moves  to 
state  2  with  probability  .5  while  action  2  yields  $2  and  the  system  moves  to  state 

2  with  certainty.  In  state  2,  either  action  yields  0  and  the  system  remains  in  state 
2.  There  are  clearly  only  two  effectively  different  elements  of  F:  f:f(  1)  =  1  and 
g:g(  1)  ==  2.  We  have,  starting  in  state  1. 

W°)  =  1  +  §/3  +  f/32  +  ...  =  2/(2  -  /3), 

W)  -  2. 

Thus,  U(fi)  =2  and  fKO0)  is  nearly  optimal  but  not  optimal.  The  verification  that 
/  satisfies  the  hypotheses  of  (d)  of  Theorem  2  is  straightforward. 

Example  2.  An  f  for  which  G(s ,  /)  is  empty  for  all  s,  but  which  is  not  nearly 
optimal .  Again  there  are  two  states,  1  and  2,  and  two  actions,  1  and  2.  In  state 
1,  action  1  yields  $3  and  the  system  remains  in  state  1  with  probability  .5.  Action 
2  yields  $6,  and  the  system  moves  to  state  2.  In  state  2,  either  action  loses  $3 
and  the  system  remains  in  state  2  with  probability  .5  and  moves  to  state  1  with 
probability  .5.  Again,  there  are  only  two  effectively  different  elements  of  F: 
/:/( 1)  =  1  and  g'.g(l)  =  2.  Straightforward  calculations  yield 

x(f)  =  x(g )  =  (^  ,  y(f)  =  ( ,  y(g)  =  , 

so  that 

V/(g)  -  Vf(f)  ->  (0  as/3  — » 1 

and  /  is  not  nearly  optimal.  The  verification  that  G(s,  f)  is  empty  for  each  s  is 
straightforward. 
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